Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 130.462
Filter
1.
PLoS One ; 19(4): e0300545, 2024.
Article in English | MEDLINE | ID: mdl-38558075

ABSTRACT

Short tandem repeat (STR) variation is an often overlooked source of variation between genomes. STRs comprise about 3% of the human genome and are highly polymorphic. Some cause Mendelian disease, and others affect gene expression. Their contribution to common disease is not well-understood, but recent software tools designed to genotype STRs using short read sequencing data will help address this. Here, we compare software that genotypes common STRs and rarer STR expansions genome-wide, with the aim of applying them to population-scale genomes. By using the Genome-In-A-Bottle (GIAB) consortium and 1000 Genomes Project short-read sequencing data, we compare performance in terms of sequence length, depth, computing resources needed, genotyping accuracy and number of STRs genotyped. To ensure broad applicability of our findings, we also measure genotyping performance against a set of genomes from clinical samples with known STR expansions, and a set of STRs commonly used for forensic identification. We find that HipSTR, ExpansionHunter and GangSTR perform well in genotyping common STRs, including the CODIS 13 core STRs used for forensic analysis. GangSTR and ExpansionHunter outperform HipSTR for genotyping call rate and memory usage. ExpansionHunter denovo (EHdn), STRling and GangSTR outperformed STRetch for detecting expanded STRs, and EHdn and STRling used considerably less processor time compared to GangSTR. Analysis on shared genomic sequence data provided by the GIAB consortium allows future performance comparisons of new software approaches on a common set of data, facilitating comparisons and allowing researchers to choose the best software that fulfils their needs.


Subject(s)
Genome, Human , Microsatellite Repeats , Humans , Microsatellite Repeats/genetics , Software , Genomics , Genotype , High-Throughput Nucleotide Sequencing
2.
PeerJ ; 12: e17184, 2024.
Article in English | MEDLINE | ID: mdl-38560451

ABSTRACT

Background: Single-cell annotation plays a crucial role in the analysis of single-cell genomics data. Despite the existence of numerous single-cell annotation algorithms, a comprehensive tool for integrating and comparing these algorithms is also lacking. Methods: This study meticulously investigated a plethora of widely adopted single-cell annotation algorithms. Ten single-cell annotation algorithms were selected based on the classification of either reference dataset-dependent or marker gene-dependent approaches. These algorithms included SingleR, Seurat, sciBet, scmap, CHETAH, scSorter, sc.type, cellID, scCATCH, and SCINA. Building upon these algorithms, we developed an R package named scAnnoX for the integration and comparative analysis of single-cell annotation algorithms. Results: The development of the scAnnoX software package provides a cohesive framework for annotating cells in scRNA-seq data, enabling researchers to more efficiently perform comparative analyses among the cell type annotations contained in scRNA-seq datasets. The integrated environment of scAnnoX streamlines the testing, evaluation, and comparison processes among various algorithms. Among the ten annotation tools evaluated, SingleR, Seurat, sciBet, and scSorter emerged as top-performing algorithms in terms of prediction accuracy, with SingleR and sciBet demonstrating particularly superior performance, offering guidance for users. Interested parties can access the scAnnoX package at https://github.com/XQ-hub/scAnnoX.


Subject(s)
Single-Cell Analysis , Software , Algorithms , Genomics , Existentialism
3.
J Immunol ; 212(8): 1255, 2024 Apr 15.
Article in English | MEDLINE | ID: mdl-38560812
4.
PeerJ ; 12: e17133, 2024.
Article in English | MEDLINE | ID: mdl-38563009

ABSTRACT

Background: In the current era of rapid technological innovation, our lives are becoming more closely intertwined with digital systems. Consequently, every human action generates a valuable repository of digital data. In this context, data-driven architectures are pivotal for organizing, manipulating, and presenting data to facilitate positive computing through ensemble machine learning models. Moreover, the COVID-19 pandemic underscored a substantial need for a flexible mental health care architecture. This architecture, inclusive of machine learning predictive models, has the potential to benefit a larger population by identifying individuals at a heightened risk of developing various mental disorders. Objective: Therefore, this research aims to create a flexible mental health care architecture that leverages data-driven methodologies and ensemble machine learning models. The objective is to proficiently structure, process, and present data for positive computing. The adaptive data-driven architecture facilitates customized interventions for diverse mental disorders, fostering positive computing. Consequently, improved mental health care outcomes and enhanced accessibility for individuals with varied mental health conditions are anticipated. Method: Following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidelines, the researchers conducted a systematic literature review in databases indexed in Web of Science to identify the existing strengths and limitations of software architecture relevant to our adaptive design. The systematic review was registered in PROSPERO (CRD42023444661). Additionally, a mapping process was employed to derive essential paradigms serving as the foundation for the research architectural design. To validate the architecture based on its features, professional experts utilized a Likert scale. Results: Through the review, the authors identified six fundamental paradigms crucial for designing architecture. Leveraging these paradigms, the authors crafted an adaptive data-driven architecture, subsequently validated by professional experts. The validation resulted in a mean score exceeding four for each evaluated feature, confirming the architecture's effectiveness. To further assess the architecture's practical application, a prototype architecture for predicting pandemic anxiety was developed.


Subject(s)
Mental Health , Pandemics , Humans , Software , Machine Learning , Anxiety Disorders
5.
BMC Bioinformatics ; 25(1): 142, 2024 Apr 02.
Article in English | MEDLINE | ID: mdl-38566005

ABSTRACT

BACKGROUND: The rapid advancement of new genomic sequencing technology has enabled the development of multi-omic single-cell sequencing assays. These assays profile multiple modalities in the same cell and can often yield new insights not revealed with a single modality. For example, Cellular Indexing of Transcriptomes and Epitopes by Sequencing (CITE-Seq) simultaneously profiles the RNA transcriptome and the surface protein expression. The surface protein markers in CITE-Seq can be used to identify cell populations similar to the iterative filtration process in flow cytometry, also called "gating", and is an essential step for downstream analyses and data interpretation. While several packages allow users to interactively gate cells, they often do not process multi-omic sequencing datasets and may require writing redundant code to specify gate boundaries. To streamline the gating process, we developed CITEViz which allows users to interactively gate cells in Seurat-processed CITE-Seq data. CITEViz can also visualize basic quality control (QC) metrics allowing for a rapid and holistic evaluation of CITE-Seq data. RESULTS: We applied CITEViz to a peripheral blood mononuclear cell CITE-Seq dataset and gated for several major blood cell populations (CD14 monocytes, CD4 T cells, CD8 T cells, NK cells, B cells, and platelets) using canonical surface protein markers. The visualization features of CITEViz were used to investigate cellular heterogeneity in CD14 and CD16-expressing monocytes and to detect differential numbers of detected antibodies per patient donor. These results highlight the utility of CITEViz to enable the robust classification of single cell populations. CONCLUSIONS: CITEViz is an R-Shiny app that standardizes the gating workflow in CITE-Seq data for efficient classification of cell populations. Its secondary function is to generate basic feature plots and QC figures specific to multi-omic data. The user interface and internal workflow of CITEViz uniquely work together to produce an organized workflow and sensible data structures for easy data retrieval. This package leverages the strengths of biologists and computational scientists to assess and analyze multi-omic single-cell datasets. In conclusion, CITEViz streamlines the flow cytometry gating workflow in CITE-Seq data to help facilitate novel hypothesis generation.


Subject(s)
Leukocytes, Mononuclear , Software , Humans , Sequence Analysis, RNA/methods , Workflow , Flow Cytometry , Membrane Proteins , Single-Cell Analysis/methods , Gene Expression Profiling/methods
6.
JACC Cardiovasc Imaging ; 17(4): 428-440, 2024 Apr.
Article in English | MEDLINE | ID: mdl-38569793

ABSTRACT

Structural heart disease interventions rely heavily on preprocedural planning and simulation to improve procedural outcomes and predict and prevent potential procedural complications. Modeling technologies, namely 3-dimensional (3D) printing and computational modeling, are nowadays increasingly used to predict the interaction between cardiac anatomy and implantable devices. Such models play a role in patient education, operator training, procedural simulation, and appropriate device selection. However, current modeling is often limited by the replication of a single static configuration within a dynamic cardiac cycle. Recognizing that health systems may face technical and economic limitations to the creation of "in-house" 3D-printed models, structural heart teams are pivoting to the use of computational software for modeling purposes.


Subject(s)
Cardiac Surgical Procedures , Heart Diseases , Humans , Predictive Value of Tests , Cardiac Surgical Procedures/methods , Computer Simulation , Heart Diseases/diagnostic imaging , Heart Diseases/therapy , Software , Printing, Three-Dimensional
7.
AAPS J ; 26(3): 39, 2024 Apr 03.
Article in English | MEDLINE | ID: mdl-38570385

ABSTRACT

A well-documented pharmacometric (PMx) analysis dataset specification ensures consistency in derivations of the variables, naming conventions, traceability to the source data, and reproducibility of the analysis dataset. Lack of standards in creating the dataset specification can lead to poor quality analysis datasets, negatively impacting the quality of the PMx analysis. Standardization of the dataset specification within an individual organization helps address some of these inconsistencies. The recent introduction of the Clinical Data Interchange Standards Consortium (CDISC) Analysis Data Model (ADaM) Population Pharmacokinetic (popPK) Implementation Guide (IG) further promotes industry-wide standards by providing guidelines for the basic data structure of popPK analysis datasets. However, manual implementation of the standards can be labor intensive and error-prone. Hence, there is still a need to automate the implementation of these standards. In this paper, we present PmWebSpec, an easily deployable web-based application to facilitate the creation and management of CDISC-compliant PMx analysis dataset specifications. We describe the application of this tool through examples and highlight its key features including pre-populated dataset specifications, built-in checks to enforce standards, and generation of an electronic Common Technical Document (eCTD)-compliant data definition file. The application increases efficiency, quality and semi-automates PMx analysis dataset, and specification creation and has been well accepted by pharmacometricians and programmers internally. The success of this application suggests its potential for broader usage across the PMx community.


Subject(s)
Software , Reproducibility of Results , Reference Standards
8.
Methods Mol Biol ; 2797: 67-90, 2024.
Article in English | MEDLINE | ID: mdl-38570453

ABSTRACT

Molecular docking is a popular computational tool in drug discovery. Leveraging structural information, docking software predicts binding poses of small molecules to cavities on the surfaces of proteins. Virtual screening for ligand discovery is a useful application of docking software. In this chapter, using the enigmatic KRAS protein as an example system, we endeavor to teach the reader about best practices for performing molecular docking with UCSF DOCK. We discuss methods for virtual screening and docking molecules on KRAS. We present the following six points to optimize our docking setup for prosecuting a virtual screen: protein structure choice, pocket selection, optimization of the scoring function, modification of sampling spheres and sampling procedures, choosing an appropriate portion of chemical space to dock, and the choice of which top scoring molecules to pick for purchase.


Subject(s)
Algorithms , Proto-Oncogene Proteins p21(ras) , Molecular Docking Simulation , Proto-Oncogene Proteins p21(ras)/genetics , Proto-Oncogene Proteins p21(ras)/metabolism , Software , Proteins/chemistry , Drug Discovery , Ligands , Protein Binding , Binding Sites
9.
Cancer Discov ; 14(4): 625-629, 2024 Apr 04.
Article in English | MEDLINE | ID: mdl-38571426

ABSTRACT

SUMMARY: The transition from 2D to 3D spatial profiling marks a revolutionary era in cancer research, offering unprecedented potential to enhance cancer diagnosis and treatment. This commentary outlines the experimental and computational advancements and challenges in 3D spatial molecular profiling, underscoring the innovation needed in imaging tools, software, artificial intelligence, and machine learning to overcome implementation hurdles and harness the full potential of 3D analysis in the field.


Subject(s)
Artificial Intelligence , Neoplasms , Humans , Machine Learning , Software , Neoplasms/diagnosis , Neoplasms/genetics
10.
Br J Math Stat Psychol ; 77(2): 289-315, 2024 May.
Article in English | MEDLINE | ID: mdl-38591555

ABSTRACT

Popular statistical software provides the Bayesian information criterion (BIC) for multi-level models or linear mixed models. However, it has been observed that the combination of statistical literature and software documentation has led to discrepancies in the formulas of the BIC and uncertainties as to the proper use of the BIC in selecting a multi-level model with respect to level-specific fixed and random effects. These discrepancies and uncertainties result from different specifications of sample size in the BIC's penalty term for multi-level models. In this study, we derive the BIC's penalty term for level-specific fixed- and random-effect selection in a two-level nested design. In this new version of BIC, called BIC E 1 , this penalty term is decomposed into two parts if the random-effect variance-covariance matrix has full rank: (a) a term with the log of average sample size per cluster and (b) the total number of parameters times the log of the total number of clusters. Furthermore, we derive the new version of BIC, called BIC E 2 , in the presence of redundant random effects. We show that the derived formulae, BIC E 1 and BIC E 2 , adhere to empirical values via numerical demonstration and that BIC E ( E indicating either E 1 or E 2 ) is the best global selection criterion, as it performs at least as well as BIC with the total sample size and BIC with the number of clusters across various multi-level conditions through a simulation study. In addition, the use of BIC E 1 is illustrated with a textbook example dataset.


Subject(s)
Software , Sample Size , Bayes Theorem , Linear Models , Computer Simulation
11.
BMJ Health Care Inform ; 31(1)2024 Apr 04.
Article in English | MEDLINE | ID: mdl-38575326

ABSTRACT

Objectives The objective of this study was to explore the feature of generative artificial intelligence (AI) in asking sexual health among cancer survivors, which are often challenging for patients to discuss.Methods We employed the Generative Pre-trained Transformer-3.5 (GPT) as the generative AI platform and used DocsBot for citation retrieval (June 2023). A structured prompt was devised to generate 100 questions from the AI, based on epidemiological survey data regarding sexual difficulties among cancer survivors. These questions were submitted to Bot1 (standard GPT) and Bot2 (sourced from two clinical guidelines).Results No censorship of sexual expressions or medical terms occurred. Despite the lack of reflection on guideline recommendations, 'consultation' was significantly more prevalent in both bots' responses compared with pharmacological interventions, with ORs of 47.3 (p<0.001) in Bot1 and 97.2 (p<0.001) in Bot2.Discussion Generative AI can serve to provide health information on sensitive topics such as sexual health, despite the potential for policy-restricted content. Responses were biased towards non-pharmacological interventions, which is probably due to a GPT model designed with the 's prohibition policy on replying to medical topics. This shift warrants attention as it could potentially trigger patients' expectations for non-pharmacological interventions.


Subject(s)
Health Communication , Neoplasms , Sexual Health , Humans , Artificial Intelligence , Software , Bias , Neoplasms/therapy
12.
BMJ Open Ophthalmol ; 9(1)2024 Apr 04.
Article in English | MEDLINE | ID: mdl-38575345

ABSTRACT

OBJECTIVE: Preclinical validation study to assess the feasibility and accuracy of electromagnetic image-guided systems (EM-IGS) in orbital surgery using high-fidelity physical orbital anatomy simulators. METHODS: EM-IGS platform, clinical software, navigation instruments and reference system (StealthStation S8, Medtronic) were evaluated in a mock operating theatre at the Royal Victoria Eye and Ear Hospital, a tertiary academic hospital in Dublin, Ireland. Five high-resolution 3D-printed model skulls were created using CT scans of five anonymised patients with an orbital tumour that previously had a successful orbital biopsy or excision. The ability of ophthalmic surgeons to achieve satisfactory system registration in each model was assessed. Subsequently, navigational accuracy was recorded using defined anatomical landmarks as ground truth. Qualitative feedback on the system was also attained. RESULTS: Three independent surgeons participated in the study, one junior trainee, one fellow and one consultant. Across models, more senior participants were able to achieve a smaller system-generated registration error in a fewer number of attempts. When assessing navigational accuracy, submillimetre accuracy was achieved for the majority of points (16 landmarks per model, per participant). Qualitative surgeon feedback suggested acceptability of the technology, although interference from mobile phones near the operative field was noted. CONCLUSION: This study suggests the feasibility and accuracy of EM-IGS in a preclinical validation study for orbital surgery using patient specific 3D-printed skulls. This preclinical study provides the foundation for clinical studies to explore the safety and effectiveness of this technology.


Subject(s)
Surgery, Computer-Assisted , Humans , Orbit/diagnostic imaging , Tomography, X-Ray Computed , Software , Electromagnetic Phenomena
13.
Genome Biol ; 25(1): 89, 2024 Apr 08.
Article in English | MEDLINE | ID: mdl-38589921

ABSTRACT

Advancements in cytometry technologies have enabled quantification of up to 50 proteins across millions of cells at single cell resolution. Analysis of cytometry data routinely involves tasks such as data integration, clustering, and dimensionality reduction. While numerous tools exist, many require extensive run times when processing large cytometry data containing millions of cells. Existing solutions, such as random subsampling, are inadequate as they risk excluding rare cell subsets. To address this, we propose SuperCellCyto, an R package that builds on the SuperCell tool which groups highly similar cells into supercells. SuperCellCyto is available on GitHub ( https://github.com/phipsonlab/SuperCellCyto ) and Zenodo ( https://doi.org/10.5281/zenodo.10521294 ).


Subject(s)
Research , Single-Cell Analysis , Cluster Analysis , Software
14.
Sci Rep ; 14(1): 8348, 2024 04 09.
Article in English | MEDLINE | ID: mdl-38594373

ABSTRACT

Single molecule fluorescence in situ hybridisation (smFISH) has become a valuable tool to investigate the mRNA expression of single cells. However, it requires a considerable amount of programming expertise to use currently available open-source analytical software packages to extract and analyse quantitative data about transcript expression. Here, we present FISHtoFigure, a new software tool developed specifically for the analysis of mRNA abundance and co-expression in QuPath-quantified, multi-labelled smFISH data. FISHtoFigure facilitates the automated spatial analysis of transcripts of interest, allowing users to analyse populations of cells positive for specific combinations of mRNA targets without the need for computational image analysis expertise. As a proof of concept and to demonstrate the capabilities of this new research tool, we have validated FISHtoFigure in multiple biological systems. We used FISHtoFigure to identify an upregulation in the expression of Cd4 by T-cells in the spleens of mice infected with influenza A virus, before analysing more complex data showing crosstalk between microglia and regulatory B-cells in the brains of mice infected with Trypanosoma brucei brucei. These analyses demonstrate the ease of analysing cell expression profiles using FISHtoFigure and the value of this new tool in the field of smFISH data analysis.


Subject(s)
Image Processing, Computer-Assisted , Software , Animals , Mice , RNA, Messenger/metabolism , In Situ Hybridization, Fluorescence/methods , Up-Regulation
15.
Angle Orthod ; 94(3): 346-352, 2024 May 01.
Article in English | MEDLINE | ID: mdl-38639456

ABSTRACT

OBJECTIVES: To investigate the dimensional stability of various 3D-printed models derived from resin and plant-based, biodegradable plastics (PLA) under specific storage conditions for a period of up to 21 weeks. MATERIALS AND METHODS: Four different printing materials, including Draft V2, study model 2, and Ortho model OD01 resins as well as PLA mineral, were evaluated over a 21-week period. Eighty 3D-printed models were divided equally into two groups, with one group stored in darkness and the other exposed to daylight. All models were stored at a constant room temperature (20°C). Measurements were taken at 7-week intervals using the Inspect 3D module in OnyxCeph software (Image Instruments GmbH, Chemnitz, Germany). RESULTS: Dimensional change was noted for all of the models with shrinkage of up to 0.26 mm over the study period. Most contraction occured from baseline to T1, although significant further contraction also arose from T1 to T2 (P < .001) and T1 to T3 (P < .001). More shrinkage was observed when exposed to daylight overall and for each resin type (P < .01). The least shrinkage was noted with Ortho model OD01 resin (0.16 mm, SD = 0.06), and the highest level of shrinkage was observed for Draft V2 resin (0.23 mm, SD = 0.06; P < .001). CONCLUSIONS: Shrinkage of 3D-printed models is pervasive, arising regardless of the material used (PLA or resin) and being independent of the brand or storage conditions. Consequently, immediate utilization of 3D printing for orthodontic appliance purposes may be preferable, with prolonged storage risking the manufacture of inaccurate orthodontic retainers and appliances.


Subject(s)
Orthodontic Retainers , Printing, Three-Dimensional , Software , Polyesters , Materials Testing
16.
BMC Bioinformatics ; 25(1): 155, 2024 Apr 20.
Article in English | MEDLINE | ID: mdl-38641616

ABSTRACT

BACKGROUND: Classification of binary data arises naturally in many clinical applications, such as patient risk stratification through ICD codes. One of the key practical challenges in data classification using machine learning is to avoid overfitting. Overfitting in supervised learning primarily occurs when a model learns random variations from noisy labels in training data rather than the underlying patterns. While traditional methods such as regularization and early stopping have demonstrated effectiveness in interpolation tasks, addressing overfitting in the classification of binary data, in which predictions always amount to extrapolation, demands extrapolation-enhanced strategies. One such approach is hybrid mechanistic/data-driven modeling, which integrates prior knowledge on input features into the learning process, enhancing the model's ability to extrapolate. RESULTS: We present NoiseCut, a Python package for noise-tolerant classification of binary data by employing a hybrid modeling approach that leverages solutions of defined max-cut problems. In a comparative analysis conducted on synthetically generated binary datasets, NoiseCut exhibits better overfitting prevention compared to the early stopping technique employed by different supervised machine learning algorithms. The noise tolerance of NoiseCut stems from a dropout strategy that leverages prior knowledge of input features and is further enhanced by the integration of max-cut problems into the learning process. CONCLUSIONS: NoiseCut is a Python package for the implementation of hybrid modeling for the classification of binary data. It facilitates the integration of mechanistic knowledge on the input features into learning from data in a structured manner and proves to be a valuable classification tool when the available training data is noisy and/or limited in size. This advantage is especially prominent in medical and biomedical applications where data scarcity and noise are common challenges. The codebase, illustrations, and documentation for NoiseCut are accessible for download at https://pypi.org/project/noisecut/ . The implementation detailed in this paper corresponds to the version 0.2.1 release of the software.


Subject(s)
Algorithms , Software , Humans , Supervised Machine Learning , Machine Learning
17.
Genome Biol ; 25(1): 101, 2024 Apr 19.
Article in English | MEDLINE | ID: mdl-38641647

ABSTRACT

Many bioinformatics methods seek to reduce reference bias, but no methods exist to comprehensively measure it. Biastools analyzes and categorizes instances of reference bias. It works in various scenarios: when the donor's variants are known and reads are simulated; when donor variants are known and reads are real; and when variants are unknown and reads are real. Using biastools, we observe that more inclusive graph genomes result in fewer biased sites. We find that end-to-end alignment reduces bias at indels relative to local aligners. Finally, we use biastools to characterize how T2T references improve large-scale bias.


Subject(s)
Genome , Genomics , Genomics/methods , Computational Biology , INDEL Mutation , Bias , Sequence Analysis, DNA/methods , Software , High-Throughput Nucleotide Sequencing/methods
18.
Database (Oxford) ; 20242024 Apr 02.
Article in English | MEDLINE | ID: mdl-38564425

ABSTRACT

Transcriptome profiling data, generated via RNA sequencing, are commonly deposited in public repositories. However, these data may not be easily accessible or usable by many researchers. To enhance data reuse, we present well-annotated, partially analyzed data via a user-friendly web application. This project involved transcriptome profiling of blood samples from 15 healthy pregnant women in a low-resource setting, taken at 6 consecutive time points beginning from the first trimester. Additional blood transcriptome profiles were retrieved from the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) public repository, representing a cohort of healthy pregnant women from a high-resource setting. We analyzed these datasets using the fixed BloodGen3 module repertoire. We deployed a web application, accessible at https://thejacksonlaboratory.shinyapps.io/BloodGen3_Pregnancy/which displays the module-level analysis results from both original and public pregnancy blood transcriptome datasets. Users can create custom fingerprint grid and heatmap representations via various navigation options, useful for reports and manuscript preparation. The web application serves as a standalone resource for exploring blood transcript abundance changes during pregnancy. Alternatively, users can integrate it with similar applications developed for earlier publications to analyze transcript abundance changes of a given BloodGen3 signature across a range of disease cohorts. Database URL: https://thejacksonlaboratory.shinyapps.io/BloodGen3_Pregnancy/.


Subject(s)
Pregnant Women , Transcriptome , Pregnancy , Humans , Female , Transcriptome/genetics , Software , Gene Expression Profiling , Databases, Genetic
19.
PLoS One ; 19(4): e0299585, 2024.
Article in English | MEDLINE | ID: mdl-38603718

ABSTRACT

The performance of the defect prediction model by using balanced and imbalanced datasets makes a big impact on the discovery of future defects. Current resampling techniques only address the imbalanced datasets without taking into consideration redundancy and noise inherent to the imbalanced datasets. To address the imbalance issue, we propose Kernel Crossover Oversampling (KCO), an oversampling technique based on kernel analysis and crossover interpolation. Specifically, the proposed technique aims to generate balanced datasets by increasing data diversity in order to reduce redundancy and noise. KCO first represents multidimensional features into two-dimensional features by employing Kernel Principal Component Analysis (KPCA). KCO then divides the plotted data distribution by deploying spectral clustering to select the best region for interpolation. Lastly, KCO generates the new defect data by interpolating different data templates within the selected data clusters. According to the prediction evaluation conducted, KCO consistently produced F-scores ranging from 21% to 63% across six datasets, on average. According to the experimental results presented in this study, KCO provides more effective prediction performance than other baseline techniques. The experimental results show that KCO within project and cross project predictions especially consistently achieve higher performance of F-score results.


Subject(s)
Algorithms , Software , Cluster Analysis , Forecasting
20.
Bioinformatics ; 40(4)2024 Mar 29.
Article in English | MEDLINE | ID: mdl-38561180

ABSTRACT

SUMMARY: Sequence technology advancements have led to an exponential increase in bacterial genomes, necessitating robust taxonomic classification methods. The Percentage Of Conserved Proteins (POCP), proposed initially by Qin et al. (2014), is a valuable metric for assessing prokaryote genus boundaries. Here, I introduce a computational pipeline for automated POCP calculation, aiming to enhance reproducibility and ease of use in taxonomic studies. AVAILABILITY AND IMPLEMENTATION: The POCP-nf pipeline uses DIAMOND for faster protein alignments, achieving similar sensitivity to BLASTP. The pipeline is implemented in Nextflow with Conda and Docker support and is freely available on GitHub under https://github.com/hoelzer/pocp. The open-source code can be easily adapted for various prokaryotic genome and protein datasets. Detailed documentation and usage instructions are provided in the repository.


Subject(s)
Prokaryotic Cells , Software , Reproducibility of Results , Genome, Bacterial
SELECTION OF CITATIONS
SEARCH DETAIL
...